42 research outputs found

    Online estimation of the geometric median in Hilbert spaces : non asymptotic confidence balls

    Full text link
    Estimation procedures based on recursive algorithms are interesting and powerful techniques that are able to deal rapidly with (very) large samples of high dimensional data. The collected data may be contaminated by noise so that robust location indicators, such as the geometric median, may be preferred to the mean. In this context, an estimator of the geometric median based on a fast and efficient averaged non linear stochastic gradient algorithm has been developed by Cardot, C\'enac and Zitt (2013). This work aims at studying more precisely the non asymptotic behavior of this algorithm by giving non asymptotic confidence balls. This new result is based on the derivation of improved L2L^2 rates of convergence as well as an exponential inequality for the martingale terms of the recursive non linear Robbins-Monro algorithm

    A fast and recursive algorithm for clustering large datasets with kk-medians

    Get PDF
    Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the kk-means algorithm, a new class of recursive stochastic gradient algorithms designed for the kk-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as kk-means, trimmed kk-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi

    Digital search trees and chaos game representation

    Get PDF
    In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which on can visualize repetitions of subwords. The CGR-tree turns a sequence of letters into a digital search tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height and the insertion depth for DST built from i.i.d. successive sequences. Here, the successive inserted wors are strongly dependent. We give the asymptotic behaviour of the insertion depth and of the length of branches for the CGR-tree obtained from the suffixes of reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained

    Variable length Markov chains and dynamical sources

    Full text link
    Infinite random sequences of letters can be viewed as stochastic chains or as strings produced by a source, in the sense of information theory. The relationship between Variable Length Markov Chains (VLMC) and probabilistic dynamical sources is studied. We establish a probabilistic frame for context trees and VLMC and we prove that any VLMC is a dynamical source for which we explicitly build the mapping. On two examples, the ``comb'' and the ``bamboo blossom'', we find a necessary and sufficient condition for the existence and the unicity of a stationary probability measure for the VLMC. These two examples are detailed in order to provide the associated Dirichlet series as well as the generating functions of word occurrences.Comment: 45 pages, 15 figure

    Dynamical Systems in the Analysis of Biological Sequences

    Get PDF
    The Chaos Game Representation (CGR) maps a sequence of letters taken from a finite alphabet onto the unit square in R2R^2. While it is a popular tool, few mathematical results have been proved to date. In this report, we show that the CGR gives rise to a limit measure, assuming only the input sequence is stationary ergodic. Some more precise properties are given in the i.i.d. and Markov cases. A new family of statistical tests to characterize the randomness of the inputs is proposed and analyzed. Finally, some basic properties of the CGR are used to generalize the notion of genomic signatur

    Persistent random walks, variable length Markov chains and piecewise deterministic Markov processes *

    Get PDF
    International audienceA classical random walk (S t , t ∈ N) is defined by S t := t n=0 X n , where (X n) are i.i.d. When the increments (X n) n∈N are a one-order Markov chain, a short memory is introduced in the dynamics of (S t). This so-called " persistent " random walk is nolonger Markovian and, under suitable conditions, the rescaled process converges towards the integrated telegraph noise (ITN) as the timescale and space-scale parameters tend to zero (see [11, 17, 18]). The ITN process is effectively non-Markovian too. The aim is to consider persistent random walks (S t) whose increments are Markov chains with variable order which can be infinite. This variable memory is enlighted by a one-to-one correspondence between (X n) and a suitable Variable Length Markov Chain (VLMC), since for a VLMC the dependency from the past can be unbounded. The key fact is to consider the non Markovian letter process (X n) as the margin of a couple (X n , M n) n≄0 where (M n) n≄0 stands for the memory of the process (X n). We prove that, under a suitable rescaling, (S n , X n , M n) converges in distribution towards a time continuous process (S 0 (t), X(t), M (t)). The process (S 0 (t)) is a semi-Markov and Piecewise Deterministic Markov Process whose paths are piecewise linear

    Digital search trees and chaos game representation

    Get PDF
    Version préliminaire (2006) d'un travail publié sous forme définitive (2009).International audienceIn this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which on can visualize repetitions of subwords. The CGR-tree turns a sequence of letters into a digital search tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height and the insertion depth for DST built from i.i.d. successive sequences. Here, the successive inserted wors are strongly dependent. We give the asymptotic behaviour of the insertion depth and of the length of branches for the CGR-tree obtained from the suffixes of reversed i.i.d. or Markovian sequence. This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained
    corecore